PCAs based on genome composition for 5972 coronavirus spike protein sequences. PCAs are colour-coded and ellipses drawn based on different outcome variables, though underlying PCA for each bias type is the same. Mouseover gives outcome variable and virus name.
Fairly clear clustering and separation of genera!
Very tight clusters, especially of SARS-CoV and SARS-Cov-2. MERS, SARS, SARS-CoV-2 actually as distinct from each other as they are from other human CoVs. And many close animal viruses…
Separation seems driven almost entirely by stop codon use, alphas preferring TGA, betas and gammas preferring TAA, deltas somewhere in between.
Epidemic coronaviruses are strongly separated from each other, but not too separated from other human viruses again. Virtually all the human viruses prefer TAA, except HCoV-HKU1 doesn’t seem fussy.
Codons contribute much more evenly now, strongest being AGA/CGT (Arginine). Gammas/deltas well separated but not others.
Human viruses don’t cluster together still, again likely reflects receptor usage as all SARS-like viruses top left
However looking at PC3 and PC4 (which explain much less overall variation) gives us a much better separation between human and non-human, regardless of receptor…?! No particular amino acid loading strongly here. Good signal to capture..!
Clusters, but not as clear a separation here.
Surprisingly, the epidemic human coronaviruses are well separated from the endemic human coronaviruses here!